A Double Combinatorial Approach to Discovering Patterns in Biological Sequences
نویسندگان
چکیده
We present in this paper an algorithm for nding degenerated common features by multiple comparison of a set of biological sequences (nucleic acids or proteins). The features that are of interest to us are words in the sequences. The algorithm uses the concept of a model we introduced earlier for locating these features. A model can be seen as a generalization of a consensus pattern as deened by Waterman 42]. It is an object against which the words in the sequences are compared and which serves as an identiier for the groups of similar ones. The algorithm given here innovates in relation to our previous work in that the models are deened over what we call a weighted combinatorial cover. This is a collection of sets among all possible subsets of the alphabet of nucleotides or amino acids, including the wild card fg, with a weight attached to each of these sets indicating the number of times it may appear in a model. In this way, we explore both the space of models and that of alphabets. The words that are related to a model deened over such a combinatorial cover, and thus considered to be similar, are then the ones that either belong to the model or present at most a certain number of errors with a nearest element of it. We use two algorithmic ideas that allow us to deal with such double combinatorics, one concerns a left-to-right minimality of the sets composing a model, the other involves making a sketch of the solution space before exploring it in detail.
منابع مشابه
High Fuzzy Utility Based Frequent Patterns Mining Approach for Mobile Web Services Sequences
Nowadays high fuzzy utility based pattern mining is an emerging topic in data mining. It refers to discover all patterns having a high utility meeting a user-specified minimum high utility threshold. It comprises extracting patterns which are highly accessed in mobile web service sequences. Different from the traditional fuzzy approach, high fuzzy utility mining considers not only counts of mob...
متن کاملEvaluation of First and Second Markov Chains Sensitivity and Specificity as Statistical Approach for Prediction of Sequences of Genes in Virus Double Strand DNA Genomes
Growing amount of information on biological sequences has made application of statistical approaches necessary for modeling and estimation of their functions. In this paper, sensitivity and specificity of the first and second Markov chains for prediction of genes was evaluated using the complete double stranded DNA virus. There were two approaches for prediction of each Markov Model parameter,...
متن کاملDiscovering sequence motifs of different patterns parallel using DNA operations
Discovery of motifs in biological sequences and various types of subsequences in commercial databases have varied applications and interpretations. This paper proposes a new approach to solve the Combinatorial Pattern Matching (CPM), search for continuous and gapped rigid subsequences and discover Longest Common Rigid Subsequences (LCRS) from the given sequences using DNA operations and modifie...
متن کاملCombinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm
Motivation: The discovery of motifs in biological sequences is an important problem. Results: This paper presents a new algorithm for the discovery of rigid patterns (motifs) in biological sequences. Our method is combinatorial in nature and able to produce all patterns that appear in at least a (user-defined) minimum number of sequences, yet it manages to be very efficient by avoiding the enum...
متن کاملCombinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229]
MOTIVATION The discovery of motifs in biological sequences is an important problem. RESULTS This paper presents a new algorithm for the discovery of rigid patterns (motifs) in biological sequences. Our method is combinatorial in nature and able to produce all patterns that appear in at least a (user-defined) minimum number of sequences, yet it manages to be very efficient by avoiding the enum...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1996